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I Abstract 

^^ This is a short introduction to Kolmogorov complexity and information 

^^ theory. The interested reader is referred to the literature, especially the 

^^ textbooks |CT91] and [LV97| which cover the fields of information theory 

C^ and Kolmogorov complexity in depth and with all the necessary rigor. 

^5 They are well to read and require only a minimum of prior knowledge. 

^-H Kolmogorov complexity. Also known as algorithmic complexity and Turing 

com,plexity. Though Kohnogorov was not the first one to formulate the idea, 
he played the dominant role in the consolidation of the theory. The concept 
itself was developed independently and with different motivation by Andrei 
N. Kolmogorov |Kol65j . Ray Solomonoff ,Sol64, and Gregory Chaitin [Cha66| . 



u 
u 

c/5 [Cha69) . 

The Kolmogorov complexity C{s) of any binary string s G {0, 1}" is the length of C(-) 

fN the shortest computer program s* that can produce this string on the Universal 

^ Turing Machine UTM and then halt. In other words, on the UTM C{s) bits of UTM 

(^ information are needed to encode s. The UTM is not a real computer but an 

t::;j- imaginary reference machine. We don't need the specific details of the UTM. 

OQ As every Turing machine can be implemented on every other one, the minimum 

iy-\ length of a program on one machine will only add a constant to the minimum 

f^ length of the program on every other machine. This constant is the length of the 

^D implementation of the first machine on the other machine and is independent 

'^ of the string in question. This was first observed in 1964 by Ray Solomonoff. 



L>( Experience has shown that every attempt to construct a theoretical model of 

%^ computation that is more powerful than the Turing machine has come up with 

C^ something that is at the most just as strong as the Turing machine. This has 

been codified in 1936 by Alonzo Church as Church's Thesis: the class of algo- 
rithmically computable numerical functions coincides with the class of partial 
recursive functions. Everything we can compute we can compute by a Turing 
machine and what we cannot compute by a Turing machine we cannot compute 
at all. This said, we can use Kolmogorov complexity as a universal measure 
that will assign the same value to any sequence of bits regardless of the model 
of computation, within the bounds of an additive constant. 



*From The Paradox of Over fitting, INan03| . 
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Incomputability of Kolmogorov complexity. Kolmogorov complexity is 
not computable. It is nevertheless essential for proving existence and bounds 
for weaker notions of complexity. The fact that Kolmogorov complexity cannot 
be computed stems from the fact that we cannot compute the output of every 
program. More fundamentally, no algorithm is possible that can predict of every 
program if it will ever halt, as has been shown by Alan Turing in his famous 
work on the halting problem |Tur36j . No computer program is possible that, 
when given any other computer program as input, will always output true if 
that program will eventually halt and false if it will not. Even if we have a 
short program that outputs our string and that seems to be a good candidate for 
being the shortest such program, there is always a number of shorter programs 
of which we do not know if they will ever halt and with what output. 

Plain versus prefix complexity. Turing's original model of computation 
included special delimiters that marked the end of an input string. This has 
resulted in two brands of Kolmogorov complexity: 

plain Kolmogorov complexity: the length C{s) of the shortest binary C(-) 

string that is delimited by special marks and that can compute x on 
the UTM and then halt. 

prefix Kolmogorov complexity: the length K{s) of the shortest binary K{-) 

string that is self- delimiting |LV97j and that can compute x on the 
UTM and then hah. 

The difference between the two is logarithmic in C{s): the number of extra bits 
that are needed to delimit the input string. While plain Kolmogorov complexity 
integrates neatly with the Turing model of computation, prefix Kolmogorov 
complexity has a number of desirable mathematical characteristics that make 
it a more coherent theory. The individual advantages and disadvantages are 
described in |LV97j . Which one is actually used is a matter of convenience. We 
will mostly use the prefix complexity K{s). 

Individual randomness. A. N. Kolmogorov was interested in Kolmogorov 
complexity to define the individual randomness of an object. When s has no 
computable regularity it cannot be encoded by a program shorter than s. Such 
a string is truly random and its Kolmogorov complexity is the length of the 
string itself plus the commando prinlFI And indeed, strings with a Kolmogorov 
complexity close to their actual length satisfy all known tests of randomness. A 
regular string, on the other hand, can be computed by a program much shorter 
than the string itself. But the overwhelming majority of all strings of any length 
are random and for a string picked at random chances are exponentially small 
that its Kolmogorov complexity will be significantly smaller than its actual 
length. 

This can easily be shown. For any given integer n there are exactly 2" binary 
strings of that length and 2" — 1 strings that are shorter than n: one empty 
string, 2^ strings of length one, 2^ of length two and so forth. Even if all strings 
shorter than n would produce a string of length n on the UTM we would still 



Plus a logarithmic term if we use prefix complexity 
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be one string short of assigning a C{s) < n to every single one of our 2" strings. 
And if we want to assign a C{s) < n — 1 we can maximally do so for 2"^^ — 1 
strings. And for C{s) < n — 10 we can only do so for 2"^^*^ — 1 strings which 
is less than 0.1% of all our strings. Even under optimal circumstances we will 
never find a C{s) < n — c for more than ^ of our strings. 

Conditional Kolmogorov complexity. The conditional Kolmogorov com- 
plexity K{s\a) is defined as the shortest program that can output s on the UTM K{- 
if the input string a is given on an auxiliary tape. K{s) is the special case K(s\e) 
where the auxiliary tape is empty. 

The universal distribution. When Ray Solomonoff first developed Kol- 
mogorov complexity in 1964 he intended it to define a universal distribution 
over all possible objects. His original approach dealt with a specific problem of 
Bayes' rule, the unknown prior distribution. Bayes' rule can be used to calculate 
P(m|s), the probability for a probabiHstic model to have generated the sample 
s, given s. It is very simple. P(s|m), the probability that the sample will occur 
given the model, is multiplied by the unconditional probability that the model 
will apply at all, P{m). This is divided by the unconditional probability of the 
sample P{s). The unconditional probability of the model is called the prior 
distribution and the probability that the model will have generated the data is 
called the posterior distribution. 

Bayes' rule can easily be derived from the definition of conditional probability; 



and 



„/ , N P(m,s) , , 

P{srn) = -^ 3 

P[m.) 



The big and obvious problem with Bayes' rule is that we usually have no idea 
what the prior distribution P{m) should be. Solomonoff suggested that if the 
true prior distribution is unknown the best assumption would be the universal 
distribution 2^^("'-' where K(rn) is the prefix Kolmogorov complexity of the 
modeQ This is nothing but a modern codification of the age old principle that 
is wildly known under the name of Occam's razor: the simplest explanation is 
the most likely one to be true. 

Entropy. Claude Shannon |Sha48| developed information theory in the late 
1940's. He was concerned with the optimum code length that could be given to 
different binary words ly of a source string s. Obviously, assigning a short code 



^ Originally Solomonoff used the plain Kolmogorov complexity C(-). This resulted in an 
improper distribution 2"'-^''"' that tends to infinity. Only in 1974 L.A. Levin introduced 
prefix complexity to solve this particular problem, and thereby many other problems as 
well |Lev74| . 



A Short Introduction to Kolmogorov Complexity 



length to low frequency words or a long code length to high frequency words 
is a waste of resources. Suppose we draw a word w from our source string s 
uniformly at random. Then the probability p{w) is equal to the frequency of w 
in s. Shannon found that the optimum overall code length for s was achieved 
when assigning to each word w a code of length — logp(w). Shannon attributed 
the original idea to R.M. Fano and hence this code is called the Shannon-Fano 
code. When using such an optimal code, the average code length of the words 
of s can be reduced to 

H{s) = -^p{w)\ogp{w) (4) 

where H{s) is called the entropy of the set s. When s is finite and we assign a H{-) 

code of length — logp{w) to each of the n words of s, the total code length is 

-^^ogp{w) = nH{s) (5) 

Let s be the outcome of some random process W that produces the words w G s 
sequentially and independently, each with some known probability p(W = w) > 0. 
K{s\W) is the Kolmogorov complexity of s given W. Because the Shannon-Fano 
code is optimal, the probability that if(s|iy) is significantly less than nH{W) 
is exponentially small. This makes the negative log likelihood of s given W a 
good estimator of K{s\W): 

K{s\W) w nH{W) 

« J2^ogp{w\W) (g) 

= -\ogpis\W) 



Relative entropy. The relative entropy D{p\\q) tells us what happens when -D('II') 

we use the wrong probability to encode our source string s. If p{w) is the true 
distribution over the words of s but we use q^w) to encode them, we end up 
with an average of H{p) + D{p\\q) bits per word. D{p\\q) is also called the 
KuUback Leibler distance between the two probability mass functions p and q. 
It is defined as 

Dip\\q) = E^HlogfS (7) 

Fisher information. Fisher information was introduced into statistics some 
20 years before C. Shannon introduced information theory |Fis25j . But it was 
not well understood without it. Fisher information is the variance of the score 
V of the continuous parameter space of our models mk ■ This needs some expla- 
nation. At the beginning of this thesis we defined models as binary strings that 
discretize the parameter space of some function or probability distribution. For 
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the purpose of Fisher information we have to temporarily treat a model nik as a 
vector in M*^. And we only consider models where for all samples s the mapping 
fsiruk) defined by fs{mk) — p{s\mk) is differentiable. Then the score V can be 
defined as 

d 

V = lnp{s\mk) 

onik 

a (8) 

p{s\mk) 

The score V is the partial derivative of In p(s\mk), a term we are already familiar 

with. The Fisher information J{mk) is J(-) 



J{mk) = Eruk 



d 



drrik 



h\p{s\mk) 



(9) 



Intuitively, a high Fisher information means that slight changes to the param- 
eters will have a great effect on p{s\mk). If J{mk) is high we must calculate 
p{s\mk) to a high precision. Conversely, if J{mi^) is low, we may round p{s\mk) 
to a low precision. 

Kolmogorov complexity of sets. The Kolmogorov complexity of a set of 
strings S is the length of the shortest program that can output the members 
of S on the UTM and then halt. If one is to approximate some string s with 
a < K(s) bits then the best one can do is to compute the smallest set S with 
K{S) < a that includes s. Once we have some 5 9 s we need at most log \S\ 
additional bits to compute s. This set S is defined by the Kolmogorov structure 
function ^s(') 

hsia) = min [log|5| : S 3 s, K{S) < a] (10) 

o 

which has many interesting features. The function hs{a) + a is non increasing 
and never falls below the line K(s) + 0{1) but can assume any form within these 
constraints. It should be evident that 

hsia) > K{s)-K{S) (11) 

Kolmogorov complexity of distributions. The Kolmogorov structure func- 
tion is not confined to finite sets. If we generalize hg {a) to probabilistic models 
rUp that define distributions over K and if we let s describe a real number, we 
obtain 

hs{a) ~ min [ — \ogp{s\mp) : p{s\mp) > 0, K{inp) < a] (12) 

where — logp{s\mp) is the number of bits we need to encode s with a code that 
is optimal for the distribution defined by rUp. Henceforth we will write mp when 
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the model defines a probability distribution and m^ with k E N when the model 
defines a probability distribution that has k parameters. A set S can be viewed 
as a special case of nip, a uniform distribution with 



p{s\mp) 



if s ^ 5 



(13) 



Minimum randomness deficiency. The randomness deficiency of a string s 

with regard to a model mp is defined as S{-\mp 



S{s\mp) = -log p{s\mp) - K{s\mp, K{mp)) 



(14) 



for p{s) > 0, and cxd otherwise. This is a generalization of the definition given 
in |VV02| where models are finite sets. If 6{s\mp) is small, then s may be 
considered a typical or low profile instance of the distribution, s satisfies all 
properties of low Kolmogorov complexity that hold with high probability for the 
support set of rup. This would not be the case if s would be exactly identical 
to the mean, first momentum or any other special characteristic of mp. 

Randomness deficiency is a key concept to any application of Kolmogorov com- 
plexity. As we saw earlier, Kolmogorov complexity and conditional Kolmogorov 
complexity are not computable. We can never claim that a particular string s 
does have a conditional Kolmogorov complexity 



K{s\mp) 



\ogp{s\mp 



(15) 



The technical term that defines all those strings that do satisfy this approxima- 
tion is typicality, defined as a small randomness deficiency 5{s\mp). 



typicality 



Mininuim randomness deficiency turns out to be important for lossy data com- 
pression. A compressed string of minimum randomness deficiency is the most 
difficult one to distinguish from the original string. The best lossy compression 
that uses a maximum of a bits is defined by the minimum randomness deficiency 
function 



/?.(«) 



[(5(s|mp) : p{s\mp) > 0, K{mp) < a] 



(16) 



Psi-) 



Minimum Description Length. The Minimum Description Length or short 

MDL of a string s is the length of the shortest two-part code for s that uses MDL 

less than a bits. It consists of the number of bits needed to encode the model 

mp that defines a distribution and the negative log likelihood of s under this 

distribution. Xs{-) 



A. (a) 



logp{s\mp) + K{m,p) : p{s\mp) > 0, K{mp) < a] (17) 



A Short Introduction to Kolmogorov Complexity 



It has recently been shown by Nikolai Vereshchagin and Paul Vitanyi in |W02j 
that a model that minimizes the description length also minimizes the random- 
ness deficiency, though the reverse may not be true. The most fundamental 
result of that paper is the equality 

/3,(a) = h,ia)+a^K{s) = \,{a) - K{s) (18) 

where the mutual relations between the Kolmogorov structure function, the 
minimum randomness deficiency and the minimum description length are pinned 
down, up to logarithmic additive terms in argument and value. 
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